This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

The cleaning steps demonstrated include:

  • Loading the data
  • Variable name cleaning
  • Variable selection
  • Designating variable classes
  • Filtering rows
  • Re-coding values
  • Creating groups (case_when())
  • Dealing with character case (upper, lower, title, etc.)
  • Factor variables

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               epitrix,    # data cleaning
               )

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

linelist_raw <- import("ebola_simulated.xlsx")

You can view the original raw dataset below:

# display the linelist data as a table
DT::datatable(linelist_raw, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

A cleaning pipeline

Text here about a cleaning pipeline of dplyr verbs… order is important

Standardize variable name syntax

Variable names are used so often, it is best that they have “clean” syntax. We suggest the following:

  • short
  • no spaces (replaced with underscores (_),
  • no unusual characters (&, #…)
  • similar nomenclature (e.g. all dates like date_onset, date_report, date_death)

The names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

names(linelist_raw)
##  [1] "case_id"         "generation"      "infection date"  "date onset"     
##  [5] "hosp date"       "date_of_outcome" "outcome"         "gender"         
##  [9] "hospital"        "lon"             "lat"             "infector"       
## [13] "source"          "age"             "age_unit"        "fever"          
## [17] "chills"          "cough"           "aches"           "vomit"
Note: To use a variable names that include spaces, surround the name with back-ticks, for example: linelist$`infection date` 
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic syntax cleaning

The function clean_names() from the package janitor is very useful. Here is an online vignette

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "case_id"         "generation"      "infection_date"  "date_onset"     
##  [5] "hosp_date"       "date_of_outcome" "outcome"         "gender"         
##  [9] "hospital"        "lon"             "lat"             "infector"       
## [13] "source"          "age"             "age_unit"        "fever"          
## [17] "chills"          "cough"           "aches"           "vomit"

Manual variable name cleaning

Re-naming variables manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new variable name is given before the old variable name.

linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the variables names have been changed:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Select variables

Often the first step of cleaning data is selecting the variables you want to work with, and their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

Within select() you can do the following:

  • Select only the variables you want to remain, and their order of appearance
# linelist dataset is piped through select() command, and prints just the variable names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names()
## [1] "case_id"              "date_onset"           "date_hospitalisation"
## [4] "fever"
  • You can tell select() which variables to remove by placing a minus symbol “-” in front of the variable name (or a vector of variable names). It will keep all others. Inside select() you can use normal operators such as : for consecutive values, c() to list values/variables, ! for opposite, & for AND, and | for OR.
linelist %>% 
  select(-c(fever:vomit)) %>% 
  names()
##  [1] "case_id"              "generation"           "date_infection"      
##  [4] "date_onset"           "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"            
## [10] "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"
  • Re-order the variables, using everything() to signify all other variables not specified:
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation"
##  [4] "fever"                "generation"           "date_infection"      
##  [7] "date_outcome"         "outcome"              "gender"              
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "chills"               "cough"               
## [19] "aches"                "vomit"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other variables not mentioned

  • last_col() - the last column

  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))

  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))

  • contains() - variables containing a character string. Example: select(contains("time"))

  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))

  • num_range() -

  • any_of() - matches if variable is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

  • where() - applies a function to all variables and selects those which are TRUE

  • select as a standalone command TO DO

Fix classes

See section on object classes

Here we want to ensure that the class of each variable is appropriate, so we’ll add it to our cleaning pipe chain.

The class of the “age” variable is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” variable is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!

## 
## 15 April 2014    2014-04-07    2014-04-21    2014-04-25    2014-04-26 
##             1             1             2             1             1 
##    2014-04-27 
##             1

This means before we can classify “date_onset” as a date, we must fix this value. We can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date. LINK TO CLASSIFYING VARIABLE AS DATE.

The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.

linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  # fix incorrect values
                                        # old value       # new value
  mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>% 
  
  # correct the class of the variables
  mutate(age           = as.numeric(age),
         date_onset    = as.Date(date_onset, format = "%Y-%m-%d"))

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! The date format = entered is often a source of problems.

Fix class for multiple variables at once

class(linelist$date_infection)
## [1] "POSIXct" "POSIXt"
head(linelist$date_infection)
## [1] "2014-04-09 UTC" NA               NA               "2014-05-07 UTC"
## [5] NA               "2014-05-06 UTC"

You can use The dplyr function across() with mutate() to convert several variables at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the variables where is.POSIXct() is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Others (is.character(), is.numeric(), and is.logical() are from base R)
  • Note that the functions in across() are written without empty parentheses
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Filter rows

After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter()

Standalone command

filter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]

Creating and modifying variables

We advise creating new variables with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use the base R style to create a new variable.

Using dplyr verbs

As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next. The verb mutate() used to add a new variable or modify an existing one.

Below are some example of creating new variables with mutate(). The syntax is: new_variable_name = value or function. It is best practice to separate each new variable with a comma and new line.

linelist <- linelist %>%                   # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,                  # new variable = duplicate/copy another variable
         new_var_static = 7,                   # new variable = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a variable, and can modify a variable multiple times
         new_var_calc   = (age / 12),          # new variable = a calculation
         new_var_paste  = paste0(hospital, " (", date_hospitalisation, ")") # new variable = pasting together values from other variables
         ) 

Scroll to the right to see the new variables:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

TIP: The verb transmute() adds new variables just like mutate() but also drops/removes all other variables that you do not mention.

Recode a variable

To recode the values in a variable, mutate() is also used.

In this case, we notice that we need to clean the variable “hospital”. There are several incorrect spelling, and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                            Connaught Hopital 
##                                           47 
##                           Connaught Hospital 
##                                         1715 
##                             Military Hopital 
##                                           30 
##                            Military Hospital 
##                                          786 
##                             Mitylira Hopital 
##                                            1 
##                            Mitylira Hospital 
##                                           79 
##                                        other 
##                                          885 
##  Princess Christian Maternity Hopital (PCMH) 
##                                           11 
## Princess Christian Maternity Hospital (PCMH) 
##                                          411 
##                               Rokupa Hopital 
##                                           11 
##                              Rokupa Hospital 
##                                          443 
##                                         <NA> 
##                                         1469

For to change spellings one-by-one, you can use the recode() function within the mutate function. The code is saying that the variable “hospital” should be defined as the current variable “hospital” but with certain changes (syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                           # OLD = NEW
                           "Mitylira Hopital"  = "Military Hospital",
                           "Mitylira Hospital" = "Military Hospital",
                           "Military Hopital"  = "Military Hospital",
                           "Connaught Hopital" = "Connaught Hospital",
                           "Rokupa Hopital"    = "Rokupa Hospital",
                           "other"             = "Other",
                           "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                           ))

table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1762 
##                            Military Hospital 
##                                          896 
##                                        Other 
##                                          885 
## Princess Christian Maternity Hospital (PCMH) 
##                                          422 
##                              Rokupa Hospital 
##                                          454 
##                                         <NA> 
##                                         1469

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

If you need to use logic statements with multiple variables, or want to use operators like %in%, use case_when() instead, as below. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic!

linelist <- linelist %>% 
  mutate(hospital = case_when(hospital == "Connaught Hopital"                           ~ "Connaught Hospital",
                              hospital == "Rokupa Hopital"                              ~ "Rokupa Hospital", 
                              hospital %in% c("Mitylira Hopital", 
                                             "Mitylira Hospital",
                                             "Mitylira Hospital",
                                             "Military Hopital")                        ~ "Military Hospital",
                              is.na(hospital)                                           ~ "Missing",
                              hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
                              TRUE                                                      ~  hospital) 
)
table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1762 
##                            Military Hospital 
##                                          896 
##                                      Missing 
##                                         1469 
##                                        Other 
##                                          885 
## Princess Christian Maternity Hospital (PCMH) 
##                                          422 
##                              Rokupa Hospital 
##                                          454 
##                                         <NA> 
##                                            0

To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Likewise you can quickly convert character values to NA using na_if(), as below:

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

logical re-coding

Intro For simple cases you can use ifelse() or if_else(). In most cases it is better to use case_when().

ifelse() and if_else():
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.

Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.

IMAGE of ifelse string with X across is.

Use case-when() instead.

You can reference other variables with the ifelse() function within mutate():

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))
linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

coalesce()

if_else(), ifelse()

recode CLEANING MISSPELLINGS HOSPITAL NAME

Replace

Using cleaning dictionaries

## load cleaning rules and only keep variables in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(variable %in% c(names(mll_raw), ".global"))

## define variables that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean variables using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Mutating grouped data

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Using base R

If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new variable by assigning it a value. In the command below, the variable new_var does not exist until after the command is executed. In this simple example the variable is assigned the static value “new value”, so for all rows the value will be “new value”.

linelist_raw$new_var <- "new value"

You can also give the new variable a dyanmic value as shown below, or using the case_when() command explained in the next tab.

# mathematical operators on another variable
linelist_raw$new_var <- linelist_raw$OLD_var * 4

Groups by condition (case_when())

TODO tutorial on using case_when()

Numeric groups

For example, creating age groups cut()

case_when()

age_categories() (R4Epis package)

by percentile

WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)

Highest in hierarchy

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)